Gradient Boost

Before moving forward with the to-do list, let’s throw a Random Forest to it.

Gradient boost

For many reasons, Random Forest is usually a very good baseline model. In this particular case I started with the polynomial OLS as baseline model, just because it was so evident from the correlations that the relationship between temperature and consumption follows a polynomial shape. But let’s go back to a beloved RF.

Model Cards provide a framework for transparent, responsible reporting. 
 Use the vetiver `.qmd` Quarto template as a place to start, 
 with vetiver.model_card()
Writing pin:
Name: 'wd-gb'
Version: 20251114T112905Z-9fee3
♻️  stepit 'gb_raw': is up-to-date. Using cached result for `strom.modelling.assess_model()` 2025-11-14 11:29:05

Metrics

Single Split CV
train test test train
MAE - Mean Absolute Error 1.299202 2.181381 2.041268 1.249593
MSE - Mean Squared Error 3.242307 15.348284 9.478075 2.866359
RMSE - Root Mean Squared Error 1.800641 3.917689 2.892303 1.692928
R2 - Coefficient of Determination 0.966331 0.813694 -0.994445 0.970756
MAPE - Mean Absolute Percentage Error 0.125487 0.214096 0.358069 0.100387
EVS - Explained Variance Score 0.966331 0.820948 -0.289647 0.970756
MeAE - Median Absolute Error 0.947582 1.394828 1.519012 0.969966
D2 - D2 Absolute Error Score 0.817784 0.668278 -0.462412 0.821892
Pinball - Mean Pinball Loss 0.649601 1.090690 1.020634 0.624796

Scatter plot matrix

Observed vs. Predicted and Residuals vs. Predicted

Check for …

check the residuals to assess the goodness of fit.

  • white noise or is there a pattern?
  • heteroscedasticity?
  • non-linearity?

Normality of Residuals:

Check for …

  • Are residuals normally distributed?

Leverage

Scale-Location plot

Residuals Autocorrelation Plot

Residuals vs Time

Again, overfits a lot.

Parameter: param_model__learning_rate

Parameter: param_model__max_depth

Parameter: param_model__min_samples_leaf

Parameter: param_model__min_samples_split

Parameter: param_model__n_estimators

Parameter: param_model__subsample

Parameter: param_vars__columns

Best model

{'model__learning_rate': 0.1,
 'model__max_depth': 5,
 'model__min_samples_leaf': 5,
 'model__min_samples_split': 48,
 'model__n_estimators': 60,
 'model__subsample': 1,
 'vars__columns': ['tt_tu_mean']}
♻️  stepit 'gb_tuned': is up-to-date. Using cached result for `strom.modelling.assess_model()` 2025-11-14 11:29:12

Metrics

Single Split CV
train test test train
MAE - Mean Absolute Error 1.652788 2.133685 1.871308 1.705797
MSE - Mean Squared Error 6.166925 13.920034 7.682780 6.219943
RMSE - Root Mean Squared Error 2.483329 3.730956 2.587570 2.490190
R2 - Coefficient of Determination 0.935960 0.831031 -0.662411 0.936311
MAPE - Mean Absolute Percentage Error 0.149954 0.226310 0.342241 0.127604
EVS - Explained Variance Score 0.935960 0.837099 -0.135969 0.936311
MeAE - Median Absolute Error 1.020238 1.471119 1.538689 1.130377
D2 - D2 Absolute Error Score 0.768192 0.675531 -0.399028 0.756513
Pinball - Mean Pinball Loss 0.826394 1.066843 0.935654 0.852898

Scatter plot matrix

Observed vs. Predicted and Residuals vs. Predicted

Check for …

check the residuals to assess the goodness of fit.

  • white noise or is there a pattern?
  • heteroscedasticity?
  • non-linearity?

Normality of Residuals:

Check for …

  • Are residuals normally distributed?

Leverage

Scale-Location plot

Residuals Autocorrelation Plot

Residuals vs Time

Compare vanilla vs. tuned

Cross-validation messages

♻️  stepit 'cross_validate_pipe': is up-to-date. Using cached result for `strom.modelling.cross_validate_pipe()` 2025-11-14 11:29:15

♻️  stepit 'cross_validate_pipe': is up-to-date. Using cached result for `strom.modelling.cross_validate_pipe()` 2025-11-14 11:29:15

Metrics

Single split

Metrics based on the test set of the single split

Cross validation

Predictions, residuals, observed

next

Time vs. Predicted and Observed

Time vs. Residuals

Model details

Pipeline(steps=[('vars',
                 ColumnSelector(columns=['tt_tu_mean', 'rf_tu_mean', 'td_mean',
                                         'vp_std_mean', 'tf_std_mean'])),
                ('model', GradientBoostingRegressor(random_state=7))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('vars', ColumnSelector(columns=['tt_tu_mean'])),
                ('model',
                 GradientBoostingRegressor(max_depth=5, min_samples_leaf=5,
                                           min_samples_split=48,
                                           n_estimators=60, random_state=7,
                                           subsample=1))])
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

TODOs